Skip to content

Implement fallback to WireServer when deploying SSH keys#3603

Merged
narrieta merged 9 commits into
Azure:developfrom
narrieta:certificates
Jun 12, 2026
Merged

Implement fallback to WireServer when deploying SSH keys#3603
narrieta merged 9 commits into
Azure:developfrom
narrieta:certificates

Conversation

@narrieta

Copy link
Copy Markdown
Member

As part of #3484, the waagent provision handler no longer pulls the entire goal state. As a side effect, when SSH keys are not available in ovfenv.xml, the fallback to the WireServer's Certificates package is no longer available. This PR adds that fallback.

Copilot AI review requested due to automatic review settings April 29, 2026 20:21

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds back the ability for the default (waagent) provisioning flow to retrieve SSH key material from the WireServer Certificates package when ovf-env.xml does not contain the full public key value, aligning behavior with pre-#3484 provisioning.

Changes:

  • Invoke a new _download_ssh_keys() step during user account configuration to fetch the Certificates goal state property when needed for SSH key deployment.
  • Add unit tests intended to verify when the Certificates package is (and isn’t) fetched.
  • Update existing ovf-env test fixtures to remove <KeyPairs> so unrelated provisioning tests don’t implicitly require certificate downloads.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
azurelinuxagent/pa/provision/default.py Adds conditional WireServer certificates download prior to deploying SSH keys.
tests/pa/test_provision.py Adds tests for the new certificate-download decision logic using mock_wire_protocol.
tests/data/ovf-env.xml Removes <KeyPairs> from baseline fixture to avoid triggering certificate download during unrelated tests.
tests/data/ovf-env-2.xml Same as above for the “pga_true” fixture.
tests/data/ovf-env-4.xml Same as above for the “pga_bad” fixture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/pa/test_provision.py
Comment thread tests/pa/test_provision.py Outdated
Comment thread azurelinuxagent/pa/provision/default.py
Comment thread azurelinuxagent/pa/provision/default.py Outdated
self.deploy_ssh_keypairs(ovfenv)

def _download_ssh_keys(self, ovfenv):
#

@narrieta narrieta Apr 29, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reason of this check, see the implementation of self.deploy_ssh_pubkeys() and self.deploy_ssh_keypairs(). In summary, the former will look for /var/lib/waagent/*.crt when ovfenv.xml doesn't have a value for the key, and the latter will look for /var/lib/waagent/*.prv when ovfenv.xml contains any key pairs.

For the format of ovfenv.xml, see the unit tests.

if not download_certificates:
return

#

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, the PA would retrieve the goal state as part of protocol detection, which retries 360 times, with a 10 second period. If that fails, the process exits and is restarted by the system, which goes again into the 360 retries, etc.

I feel just a few tries are enough, let me know if you think otherwise; thanks.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked that cloud-init also retries for a long period of time, so I decided to use the same strategy as protocol detection

Comment thread tests/data/ovf-env-2.xml
<Value>ssh-rsa AAAANOTAREALKEY== foo@bar.local</Value>
</PublicKey>
</PublicKeys>
<KeyPairs>

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key pairs in the unit tests are not used, and with the new code they would trigger the Certificates download. To fix this, quite a few tests need to be rewritten to mock that download. Since the keys are not used by the tests, I decided to remove them instead.

@codecov

codecov Bot commented Apr 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.46%. Comparing base (3aebcdd) to head (bbc1df9).
⚠️ Report is 561 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3603      +/-   ##
===========================================
+ Coverage    71.97%   73.46%   +1.48%     
===========================================
  Files          103      122      +19     
  Lines        15692    18608    +2916     
  Branches      2486     2488       +2     
===========================================
+ Hits         11295    13670    +2375     
- Misses        3881     4341     +460     
- Partials       516      597      +81     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread tests/pa/test_provision.py
Comment thread azurelinuxagent/pa/provision/default.py Outdated
self.deploy_ssh_pubkeys(ovfenv)
self.deploy_ssh_keypairs(ovfenv)

def _download_ssh_keys(self, ovfenv):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since its not download always, can we rename to _download_ssh_keys_if_needed or something like that?

<SSH>
<PublicKeys>
<PublicKey>
<Fingerprint>EB0C0AB4B2D5FC35F2F0658D19F44C8283E2DD62</Fingerprint>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that the customer should only provide one of fingerprint and value, or is it acceptable to provide both?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original sample file from which I copied this one includes both. In practice, I believe only 1 would be populated.

Now, the logic in OSUtil.deploy_ssh_pubkey() prioritizes the value over the thumbprint, so in terms of the unit test that uses this new file, having both is a good test case.

Copilot AI review requested due to automatic review settings June 8, 2026 19:09

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Comment thread tests/pa/test_provision.py
Comment thread tests/pa/test_provision.py Outdated
Comment on lines +446 to +451
with TestProvision._create_provision_handler_with_mock_protocol() as (handler, protocol):
protocol.set_http_handlers(http_get_handler=mock_http_get)
with patch('azurelinuxagent.pa.provision.default.PROBE_INTERVAL', 0): # set the delay between retries to 0
handler._download_ssh_keys_if_needed(ovfenv)
self.assertEqual(2 * MAX_RETRY, mock_http_get.call_count, "Expected maximum number of retries ({0}) to have been attempted".format(MAX_RETRY)) # times 2 since two cyphers are attempted for FIPS support
ssh_key_path = os.path.join(conf.get_lib_dir(), '8979F1AC8C4215827BF3B5A403E6137B504D02A4.crt')

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't fix, test takes .3 secs to execute and want to avoid excessive use of mocks

Comment thread azurelinuxagent/common/protocol/goal_state.py Outdated
Copilot AI review requested due to automatic review settings June 8, 2026 23:43

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment thread tests/pa/test_provision.py Outdated
Comment on lines 547 to +551
# A failure to download the certificates won't necessarily produce an error. Certificates do not change often and they may have
# already been saved to disk on a previous goal state. We simply report the error and continue processing the goal_state; later on,
# already been saved to disk on a previous goal state. Re-raise the exception only if explicitly requested via the ignore_download_errors
# parameter, otherwise simply report the error and continue processing the goal_state; later on,
# before extensions are processed, the Agent checks whether the required certificates are already on disk and refreshes the goal
# state if they are not.
# state if they are not

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't fix; use of a boolean should be easy to understand by future maintaners


xml_doc = parse_doc(xml_text)
data = findtext(xml_doc, "Data")
if data is None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are download failures treated differently than missing data element?

Right now we continue in the case of missing data, but raise in the case of download/decryption errors (when ignore_download_error is False).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would be different: a download error vs a successful request with no data. The way the code looks, it seems that it is (or it was) possible to get a response without any data. I've never seen such a response.

Copilot AI review requested due to automatic review settings June 12, 2026 14:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

@narrieta narrieta merged commit c1988cf into Azure:develop Jun 12, 2026
14 checks passed
@narrieta narrieta deleted the certificates branch June 12, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants